Fix regression bug in tutorial #2399

piskvorky · 2019-03-02T19:33:19Z

There is a minor bug in the Corpora and vector spaces: The import of corpora got removed by in #2192 here (I assume by accident).

Identified by a confused user on our mailing list.

This PR reconstitutes the import, plus fixes the code style since I was at it.

mpenkov · 2019-03-03T13:30:34Z

docs/src/tut1.rst

-    >>> texts = [[word for word in document.lower().split() if word not in stoplist]
-    >>>          for document in documents]
+    >>> texts = [
+    >>>     [word for word in document.lower().split() if word not in stoplist]


Do we want to demonstrate the best practice for string tokenization here?

The best practice is to use a dedicated library (e.g. NLTK)… do we want to complicate the tutorial?

Maybe we should.

But if we don't, this should be something really stupid (like it is now), so people don't accidentally copy-paste thinking it's a good idea.

I was thinking more along the lines of https://radimrehurek.com/gensim/utils.html#gensim.utils.tokenize

WDYT?

OK, that's an option (but utils.simple_preprocess better),

mpenkov · 2019-03-03T13:32:04Z

docs/src/tut1.rst

+    >>> texts = [
+    >>>     [word for word in document.lower().split() if word not in stoplist]
+    >>>     for document in documents
+    >>> ]
    >>>
    >>> # remove words that appear only once
    >>> frequency = defaultdict(int)


I know this isn't part of the PR, but it may be better to simplify this section as:

# untested code but "should" work import collections frequency = collections.Counter() for text in texts: frequency.update(text)

Fine with me.

mpenkov · 2019-03-03T13:34:23Z

docs/src/tut1.rst

@@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory:
    >>> # collect statistics about all tokens
    >>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))


Can we use a real corpus here? Also, we may want to show the best practice for string tokenization here.

That way, we can doctest this file.

mpenkov

Looks good, left some minor comments about potential improvements. If they're out of scope for the current PR, it may be good to capture them in a separate issue so that they get done eventually. Let me know what you think.

mpenkov · 2019-03-20T05:58:44Z

There's a ton of improvements we could make to the tutorials, but they're out of scope for this PR. I opened #2424 instead.

Fix regression bug in tutorial

3488f66

piskvorky requested a review from mpenkov March 2, 2019 19:33

piskvorky added the 3.7.2 label Mar 2, 2019

mpenkov reviewed Mar 3, 2019

View reviewed changes

mpenkov merged commit 3fa7cd5 into develop Mar 20, 2019

mpenkov deleted the piskvorky-patch-1 branch March 20, 2019 05:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix regression bug in tutorial #2399

Fix regression bug in tutorial #2399

piskvorky commented Mar 2, 2019 •

edited

Loading

mpenkov Mar 3, 2019

piskvorky Mar 3, 2019

mpenkov Mar 7, 2019

piskvorky Mar 7, 2019

mpenkov Mar 3, 2019 •

edited

Loading

piskvorky Mar 3, 2019

mpenkov Mar 3, 2019 •

edited

Loading

mpenkov left a comment

mpenkov commented Mar 20, 2019

		@@ -207,8 +214,11 @@ Similarly, to construct the dictionary without loading all texts into memory:
		>>> # collect statistics about all tokens
		>>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt'))

Fix regression bug in tutorial #2399

Fix regression bug in tutorial #2399

Conversation

piskvorky commented Mar 2, 2019 • edited Loading

mpenkov Mar 3, 2019

Choose a reason for hiding this comment

piskvorky Mar 3, 2019

Choose a reason for hiding this comment

mpenkov Mar 7, 2019

Choose a reason for hiding this comment

piskvorky Mar 7, 2019

Choose a reason for hiding this comment

mpenkov Mar 3, 2019 • edited Loading

Choose a reason for hiding this comment

piskvorky Mar 3, 2019

Choose a reason for hiding this comment

mpenkov Mar 3, 2019 • edited Loading

Choose a reason for hiding this comment

mpenkov left a comment

Choose a reason for hiding this comment

mpenkov commented Mar 20, 2019

piskvorky commented Mar 2, 2019 •

edited

Loading

mpenkov Mar 3, 2019 •

edited

Loading

mpenkov Mar 3, 2019 •

edited

Loading